Abstract: Description of the project and the findings

The dataset in question is of used cars in a popular catalog from the country of Belarus. This dataset will be cleaned and analyzed, to answer several questions about the effect that certain variables have on the price point at which cars are sold at. Ultimately, Multiple Linear Regression Models, SVR, Decision Tree Regression, Random Forest Regression, and KNN will all be used to predict the market price of a used car in Belarus given the attributes found in the dataset. The accuracy of these models were analyzed and performance tested for real-world application with the help of a training and testing dataset. Afterwords, the models were compared to find the most effective model for predicting the price of a used car in Belarus. STATE OUR FINDINGS

Introduction: Overview of the problems/goals and how they will be solved

We went through many datasets to find the one we were all interested in. We chose this dataset because it had a significant amount of samples, and there was a mixture of continuous and categorical variables. Since cars are such a big investment we all wondered what affected the price on cars. This is what sparked our initial questions and goals. Solving our proposed questions and goal we hope to gain insights in what affects car prices which would leave us with more knowledge we can use later in life. To solve our questions and answer our goal we are going to use R to make graphs to answer our questions. We will make sure to use different statistical analyses to make sure our graphs are confirmed to be accurate.

Initial Questions

The main goal is to create a predictive model based on insights gained from analyzing the impact of variables on the selling price of a vehicle. The goal was chosen for its applicability and general interest of which attributes impact price of a used car. In addition, there will be several questions that will be answered concerning the dataset. These questions aid in solving the main goal and are important insights to be gleaned from the dataset. The questions that will be answered about the dataset are as follows:

Data Analysis and Modeling:

Visualizations used to analyze data

In an attempt to gain a robust knowledge of our dataset several visualizations were used. Visualization of our data was done in two sections. Initially each attribute was graphed in order to gain a general understanding of how the samples are distributed for each attribute. This was done with the help of Bar Graphs, Histograms, Boxplots, the count function, and pie graphs.

# 1)What is the distribution of manufacturers?
ggplot(cars_edited, aes(y = manufacturer_name)) + geom_bar(aes(fill = manufacturer_name)) + geom_text(stat='count', aes(label=..count..), hjust=1)

# We can see a large difference in the amount cars for each manufacturers. Volkswagen, Opel, BMW, Audio, AvtoVAZ, Ford, Renault, and Mercedes-Benz are the majot manufacturers.
# 2) A table to show unique car model names and quantity
View(cars_edited %>% count(model_name))

# 3) Plotting the number of cars with automatic or mechanical transmissions
transmissionGrouped <- group_by(cars_edited, transmission)
transmissionCounted <- count(transmissionGrouped)
percentTransmission <- paste0(round(100*transmissionCounted$n/sum(transmissionCounted$n), 2), "%")
pie(transmissionCounted$n, labels = percentTransmission, main = "Transmission Distribution", col = rainbow(nrow(transmissionCounted)))
legend("right", c("Automatic", "Mechanical"), cex = 0.8,
       fill = rainbow(length(transmissionCounted)))

# Mechanical is significantly more common than Automatic. This will definitely be an attribute to consider in our final model
# 4) Plotting cars by color and quantity
ggplot(cars_edited, aes(x = color)) + geom_bar(stat = "count", aes(fill = color)) + geom_text(stat = "count", aes(label = after_stat(count)), vjust = -1)

# There is a lot of diversity in the colors and once again although there are categories with more values there is still a decent amount of variation

# 5) Histogram Odometer Value: Graph to see how the data is skewed
ggplot(cars_edited, aes(odometer_value)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The data is left-skewed with what appears to be outliers as around 1,000,000 miles. 

# 6) Histogram Year produced: Graph to see how the data is skewed
ggplot(cars_edited) + geom_histogram(aes(year_produced))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# The graph seems to almost be normally distributed minus what appears to be some outliers on the the older end of the years
# 7) Graph to show what fuel distribution
ggplot(cars_edited, aes(x = engine_fuel)) + geom_bar(stat = "count", aes(fill = engine_fuel)) + geom_text(stat = "count", aes(label = after_stat(count)), vjust = -1)

# There are only 2 major engine fuels(gasoline and diesel)
# 8) Pie graph to show engine type Distribution (Electric, Diesel, Gasoline)
TypeGrouped <- group_by(cars_edited, engine_type)
TypeCounted <- count(TypeGrouped) 
percentType <- paste0(round(100*TypeCounted$n/sum(TypeCounted$n), 2), "%")
pie(TypeCounted$n, labels = percentType, main = "Engine Type Distribution", col = rainbow(nrow(TypeCounted)))
legend("right", c("diesel", "electric", "gasoline"), cex = 0.8,
       fill = rainbow(nrow(TypeCounted)))

# Not surprisingly gasoline and diesel are the 2 most common Engine Type considering the fuel distribution

# 9) Table for Engine capacity
View(cars_edited %>% count(engine_capacity))
# Engine Capacity seems to be left-skewed which may indicate outliers

# 10) Bar graph Body type: count how many cars have the same body type
ggplot(cars_edited, aes(x = body_type), stat = "count") + geom_bar() + geom_text(stat = "count", aes(label = after_stat(count)), vjust = -1)

# There is some diversity in body type and the diversity in categories may lend itself to useful data for a future model

# 11) Graph Drivetrain distribution:
drivetrainGrouped <- group_by(cars_edited, drivetrain)
drivetrainCounted <- count(drivetrainGrouped) 
percentdrivetrain <- paste0(round(100*drivetrainCounted$n/sum(drivetrainCounted$n), 2), "%")
pie(drivetrainCounted$n, labels = percentdrivetrain, main = "Drivetrain Distribution", col = rainbow(nrow(drivetrainCounted)))
legend("right", c("all", "front", "rear"), cex = 0.8,
       fill = rainbow(nrow(drivetrainCounted)))

# Although most vehicles are front wheel drive there is enough all and real wheel drive to gather some promising insights 

# 12) Number of cars with same price
ggplot(cars_edited, aes(x = price_usd), stat = "count") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# This graph is extremely left-skewed. The sheer usefulness of price in our dataset will make this our main response attribute.

# 13) Pie Graph exchangeability Distribution
exchangeableGrouped <- group_by(cars_edited, is_exchangeable)
exchangeableCounted <- count(exchangeableGrouped) 
percentexchangeable <- paste0(round(100*exchangeableCounted$n/sum(exchangeableCounted$n), 2), "%")
pie(exchangeableCounted$n, labels = percentexchangeable, main = "Exchangeability Distribution", col = rainbow(nrow(exchangeableCounted)))
legend("right", c("False", "True"), cex = 0.8,
       fill = rainbow(nrow(exchangeableCounted)))

# Exchangeability is more common than anticipated. It will be interesting to see if pricier or cheaper cars consent to exchanges
# 14) Pie Graph Location region: Count the number of cars in a region
regionPriceDF <- group_by(cars_edited, location_region)
regionPriceDFCount <- count(regionPriceDF)
percentRegion <- paste0(round(100*regionPriceDFCount$n/sum(regionPriceDFCount$n), 2), "%")
pie(regionPriceDFCount$n, labels = percentRegion, main = "Region Price Distribution", col = rainbow(nrow(regionPriceDFCount)))
legend("right", c("Brest Region", "Gomel Region", "Grodno Region", "Minsk Region", "Mogilev Region", "Vitebsk Region"), cex = 0.8,
       fill = rainbow(nrow(regionPriceDFCount)))

# Minsk accounts for a very large amount of vehicles(makes sense considering the population sizes) with even distributions everywhere else. 
# The usefulness of the attribute may be less since Minsk is such a large portion of the data.
# 15) Histogram Number of photos: Graph to see how the data is skewed
ggplot(cars_edited) + geom_histogram(mapping = aes(number_of_photos))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Graph is very left-skewed. I suspect photos may increase the value of a vehicle, but tests will need to be done on this

# 16) Box plot Number of photos: Graph to see how the data is skewed
ggplot(cars_edited) + geom_boxplot(mapping = aes(number_of_photos))

# There are many outliers. With extra time we may be able to investigate the impact of these outliers on the data.

# 17) Histogram Up counter: investigating how our outliers look with our modifications
ggplot(cars_edited) + geom_histogram(mapping = aes(up_counter))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Clearly an outliers exists since there is a large scale. 

# 18) Box plot Duration listed: investigating how our outliers look with our modifications
ggplot(cars_edited) + geom_boxplot(mapping = aes(duration_listed))

# There is a significant amount of outliers. There is no evidence to conclude these should be eliminated.

# 19) Histogram Duration listed: Graph to see how the data is skewed
ggplot(cars_edited) + geom_histogram(aes(duration_listed))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Once a general understanding of each attribute was gained we began to visualize various combinations of attributions. These visualizations were vital in seeing different relationships among our samples. The visualizations that were used were Balloon Plot, Scatterplot, frequency polygons, dplyr::summarize, and Boxplots.

# 1) Graph to show the amount of cars(by manufacturer name) in a region BALLOON PLOT
ggplot(cars_edited, aes(location_region, manufacturer_name)) + geom_count()

# Due to the quantity of categories a test will need to be done to gather significant data
# 2) Graph to show the price of a car according to its year produced SCATTER PLOT
ggplot(cars_edited, aes(year_produced, price_usd)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# There exists a parabolic relationship between year_produced and price_usd

# 3) Graph to show the number of cars in specific colors(10 red cars, 8 blue cars etc.) by region BAR GRAPH
ggplot(cars_edited, aes(color)) + geom_bar(aes(fill = location_region))

# From looking at the bar graph there does not seem to be any significant differences in color distribution for locations

# 4) Graph to show the price of a car according to it's millage(odometer) SCATTER PLOT
ggplot(cars_edited, aes(odometer_value, price_usd)) + geom_point(aes(color = is_exchangeable)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# This graph is incredible diverse and indicates that there is a need for advanced models to access price relationships.

# 5)Graph to show the price of a car according to it's year produced AND body type SCATTER PLOT
ggplot(cars_edited, aes(year_produced, price_usd)) + geom_point(aes(color = body_type)) + geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# There seems to be a parabolic relationship between year_produced and price_usd

# 6) Group by car body type and get it's mean price
group_by(cars_edited, body_type) %>% summarise(price_mean = mean(price_usd))
## # A tibble: 12 x 2
##    body_type price_mean
##    <chr>          <dbl>
##  1 cabriolet     10976.
##  2 coupe          7458.
##  3 hatchback      4036.
##  4 liftback       7873.
##  5 limousine      8154.
##  6 minibus        8466.
##  7 minivan        6131.
##  8 pickup        11748.
##  9 sedan          5782.
## 10 suv           13768.
## 11 universal      5017.
## 12 van            6675.
# 7) Graph to show the outliers with body type and price BOX PLOT
ggplot(cars_edited) + geom_boxplot(mapping = aes(x = reorder(body_type, price_usd), y =
                                                   price_usd))

# 8) Graph to show the correlation between car body type, price, AND engine fuel
ggplot(cars_edited) + geom_point(mapping = aes(x = body_type, y = price_usd, color = engine_fuel))

# 9) Graph to show the price of a car according to it's number of photos incl. engine fuel SCATTER PLOT
ggplot(cars_edited) + geom_point(mapping = aes(x = number_of_photos, y = price_usd, color = engine_fuel))

# 10) Group cars by manufacturer, and get it's mean price
cars_edited %>% group_by(manufacturer_name) %>% summarize(mean(price_usd)) %>% View()

Several machine learning techniques were used in an attempt to create the most accurate prediction model. The models used in the project were as follows:

Multiple Linear Regression: This was used for the sake of gauging the impact of the continuous attributes on the price of the vehicle. This model is simple to understand and involves less computing power than more advance models which led to our decision to utilize this ML method as our first model.

SVR: Used to see if there is a model that can handle every attribute and adjust the model according to its impact on price. Although this model involves much more computing power we believed that the robust nature of this model lends itself to a high level of accuracy. This models effectiveness with categorical data, and ability to work with both linear and non-linear boundaries makes it a prime model for our dataset. Furthermore, in order to optimize our SVR model we will use linear, polynomial, and radial kernel transformations and compare the results from the models that we generate.

Decision Tree (Partition): Used for predictive modeling. Since it is incredibly robust and relies on very few assumptions we believed it would be able to handle possible outliers in our data and work around the size of our dataset to produce an optimal model. Its simpler nature also makes it a model that would be preferred over other models(such as Random Forest Regression or KNN).

Random Forest Regression: Albeit, Random Forest Regression is more complicated than a Decision Tree since it leverages multiple decision trees it is still a crucial model for our dataset. The reason for using this model is for the sake of seeing if we can create an even more accurate model. If we find the accuracy is marginally better it may be preferred to use a Decision Tree since its easier to compute. Nevertheless, the Random Forest Regression is a wonderful tool in creating a predictive model and worth testing out for the sake of optimization.

KNN:K Nearest Neighbor was chosen due to its popularity and simplicity. KNN is able to handle our Categorical as well as Continuous attributes and for that reason it is a good model to be used to predict price. It will be vital to compare this model with the other models to investigate how useful of a model it is for our dataset.

PHASE 2 Portion: Describe the different ways you tried to answer the questions and goals

How did you reach your conclusions?

What graphs and techniques we chose to solve our questions and why the techniques/graphs were used.

Q: What impact does a region have on price?

A bar graph was used to show how the average price of a vehicle differed amount different regions in Belarus. However, inspection alone was not able to tell us whether region had a significant impact on the price. To test the impact of a region on price we used the One-Way Anova test. Anova was used was because the region attribute consists of several categories and we wished to see its impact on a single continuous variable.

Q: What is the distribution of manufacturers and whether manufacturers have a significant impact on the asking price of a vehicle?

Two bar graphs were used in solving this question. First a bar graph was used to plot the distribution of vehicles for each manufacturer. Then another bar graph was used as a means of quickly inspecting how average price ranged between different manufacturers. As useful as these graph were in understanding the relationship between price and manufacturer it proved to be insufficient in proving a result. In order to prove that there existed a significant relationship between manufacturer and asking price of a car One-Way Anova was utilized. Once again were were dealing with a categorical data with several categories which lends the problem nicely to this sort of test.

Q: What is the relationship between odometer and price?

A Scatter plot was used to quickly inspect for possible relationships between price and odometer. In order to get a definite response several more methods were employed. First, a correlation test was performed to check for any correlation between price and odometer. Although, correlation does not imply causation we were still able to use to information to understand that there was some relationship between the variables. Afterwords, it was natural to investigate how well of a predictor odometer was on the price of a vehicle. In order to do this a linear regression model was created to study the linear relationship and check the significance of the variable. We were later able to calculate the prediction error and come to vital conclusions from this information.

Q: Does the number of photos a vehicle has impact the selling price?

Graphs used: Scatter plot How these graphs helped us solve the problem: This graph helped us solve the problem by allowing us to see the relationship between the two variables by doing a line of best fit.

Q: Does the number of times a vehicle has been upped in the catalog to raise its position impact the selling price?

Graphs used: Scatter plot How these graphs helped us solve the problem: Using a scatter plot for this question again allowed us to see the relationship between the two variables. With that we can come up with a solution to the question.

Q: Relationship between Engine Type and Body Type? What is the impact of Engine Type and Body Type on the selling price?

Graphs used: Mosaic How these graphs helped us solve the problem: This graph helped us solve the problem by giving us a colored graph that showed the amount of cars with a specific body type and engine. We were able to confirm the acuracy of the graph and come to a conclusion.

Q: What is the most popular model and whether we can conclude that the popularity of a model has a direct impact on the price of a vehicle?

Graphs used: Count, group by, and oneway anova How these graphs helped us solve the problem: We didn’t need to have a graph for this question. It was very effective to group the data, count it and look at the view of the data.

Q: What is the average age of each vehicle manufacturer and whether the manufacturer changes how the production year impacts the selling price?

Graphs used: Scatter plot How these graphs helped us solve the problem: The use of a scatter plot here is profound. It was VERY clear of an answer, just by a quick inspection of the graph there was a clear answer.

Final outcomes and Analysis

Answer the goal here: In an attempt to create the most accurate and significant predictive model we created 6 different models which had a wide range of accuracy.

Linear Regression Model

Firstly, we created a Multiple Linear Regression Model which utilized the attributes in our dataset that had continuous data. The code for this looks as follows:

LMCont <- lm(price_usd ~ odometer_value
             + year_produced
             + number_of_photos
             + duration_listed
             + up_counter
             , data = train.data)

step.LMConts <- LMCont %>% stepAIC(trace = FALSE)

Now that the model is created it is vital to check the accuracy of the model. To do this we will employed our test.data to test how accurately the Linear Regression model can predict the price of a vehicle.

# Predict using Multiple Linear Regression Model
LMContPrediction <- predict(step.LMConts, test.data)

# Prediction error, rmse
RMSE(LMContPrediction,test.data$price_usd) # RMSE is 4573.511
## [1] 4573.511
# Compute R-square
R2(LMContPrediction,test.data$price_usd) ## R^2 for test/train is 50.95891%
## [1] 0.5095891

From the code above we see that our rmse is 4573.511 which represents an error rate of 4573.511/mean(test.data$price_usd) = 68.60393 which is not good at all. Meanwhile, the R2 is 0.5095891, meaning that the observed and predicted outcome values are not very correlated, which is not good. These results are not surprising whatsoever and merely inform us that the price of a vehicle in Belarus is dependent on more attributes than simply our continuous attributes. We shall proceed with more robust models in order to achieve a better result.

After doing the Linear Regression Model we attempted to use SVR in order to create a more robust predictive model.

SVR Model

SVR is an extremely robust model which would be able to handle our categorical data and almost certainly achieve a better result that the Linear Regression Model. 3 SVR models were calculated were varying accuracies. The different methods used with SVR were linear, polynomial, and radial. Linear was the first SVR model to be run and the code was as follows:

# Create SVR Model using Linear Method
modelSVRLinTrain <- train( price_usd ~ ., data = train.data, method = "svmLinear",
                           trControl = trainControl("cv", number =10),
                           preProcess = c("center", "scale"),
                           tuneLength = 10
)

summary(modelSVRLinTrain)
#Length  Class   Mode 
#1   ksvm     S4 
modelSVRLinTrain$bestTune
#C
#1 1

# Predict using SVR Model with Linear Method
modelSVRLinTrainPrediction <- predict(modelSVRLinTrain, test.data)

# Prediction error, rmse
RMSE(modelSVRLinTrainPrediction,test.data$price_usd)
#[1] 3257.887

# Compute R-square
R2(modelSVRLinTrainPrediction,test.data$price_usd)
#[1] 0.7772176

Afterwords, the polynomial method was used with the SVR model:

# Create SVR Model using Polynomial Method
Method
modelSVRPolyTrain <- train(price_usd ~ ., data = train.data, method = "svmPoly",
                           trControl = trainControl("cv", number =10),
                           preProcess = c("center", "scale"),
                           tuneLength = 10
)

summary(modelSVRPolyTrain)
modelSVRPolyTrain$bestTune

# Predict using SVR Model with Linear Method
modelSVRPolyTrainPrediction <- predict(modelSVRPolyTrain, test.data)

# Prediction error, rmse
RMSE(modelSVRPolyTrainPrediction,test.data$price_usd)

# Compute R-square
R2(modelSVRPolyTrainPrediction,test.data$price_usd)

Lastly, the radial method was used with the SVR model:

# Create SVR Model using Radial Method
modelSVRRadialTrain <- train(price_usd ~ ., data = train.data, method = "svmRadial",
                             trControl = trainControl("cv", number =10),
                             preProcess = c("center", "scale"),
                             tuneLength = 10
)

summary(modelSVRRadialTrain)
modelSVRRadialTrain$bestTune

# Predict using SVR Model with Radial Method
modelSVRRadialTrainPrediction <- predict(modelSVRRadialTrain, test.data)

# Prediction error, rmse
RMSE(modelSVRRadialTrainPrediction,test.data$price_usd)
#[1] 4752.231

# Compute R-square
R2(modelSVRRadialTrainPrediction,test.data$price_usd)
#[1] 0.5937837

Even though the SVR model did much better than the Linear Regression Model it was vital to investigate more models in an attempt to create an even more accurate model. The Decision Tree was a price candidate since it is incredibly robust and relies on very few assumptions. Its simpler nature also makes it a model that would be preferred over other models(such as Random Forest Regression or KNN).

Decision Tree Regression Model

model_DT_Train <- train(price_usd ~ ., data = train.data, method = "rpart",
                        trControl = trainControl("cv",number = 10),
                        preProcess = c("center","scale"),
                        tuneLength = 10)

summary(model_DT_Train)
model_DT_Train$bestTune
plot(model_DT_Train)

# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(model_DT_Train$finalModel)
text(model_DT_Train$finalModel, digits = 3)
#Decision rules in the model model_DT_Train$finalModel

# Make predictions on the test data prediction_DT_Train <- model_DT_Train %>% predict(test.data)

# Prediction error, rmse RMSE(prediction_DT_Train,test.data$price_usd)

 #[1] 3245.413

# Compute R-square R2(prediction_DT_Train,test.data$price_usd) 

#[1] 0.7529956

Random Forest Tree Model

random_forest_ranger <- train(price_usd ~ . ,
                              data = train.data,
                              method = "ranger",
                              trControl = trainControl("cv", number = 10),
                              preProcess = c("center","scale"),
                              tuneLength = 10
)

summary(random_forest_ranger)
random_forest_ranger$bestTune
plot(random_forest_ranger)


# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(random_forest_ranger$finalModel)
text(random_forest_ranger$finalModel, digits = 3)

# Make predictions on the test data
rf_predict_ranger <- predict(random_forest_ranger, test.data , type='response')

# Prediction error, rmse
RMSE(rf_predict_ranger,test.data$price_usd)


# Compute R-square
R2(rf_predict_ranger,test.data$price_usd)

KNN Model

model_knn <- train(
  price_usd ~., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center","scale"),
  tuneLength = 20
)

summary(model_knn$finalModel)
# Print the best tuning parameter k that maximizes model accuracy model$bestTune 
model$bestTune
# Plot model accuracy vs different values of k
plot(model_knn)
# Make predictions on the test data
knn_predictions <- model_knn %>% predict(test.data)
head(knn_predictions)
# Compute the prediction error RMSE RMSE(knn_predictions,test.data$price_usd)
# Compute R-square R2(knn_predictions,test.data$price_usd) 

Compare results obtained using different ML methods

Our choices: Linear Regression Decision Trees SVR (Linear and NonLinear) Random Forest Regression KNN

What are the answers for the proposed questions?

Q: What impact does a region have on price?

A: Although the region does impact price the extent of this impact would have to be assessed in a model that includes more attributes. The reason for this is because correlation does not necessitate causation. In other words, more attributes may be at play and to get a better understanding of the impact of region on price it will be vital to access the role region has in the overall models.

Q: What is the distribution of manufacturers and whether manufacturers have a significant impact on the asking price of a vehicle?

A: We can confirm that there is a relationship between the manufacturer and asking price. The relationship seems to be one of the larger factor for asking price, but is not significant enough to predict price on its own.

Q: What is the relationship between odometer and price?

A: There is a low negative correlation between price and odometer.

Q: Does the number of photos a vehicle has impact the selling price?

A: A low positive correlation between number of photos a vehicle has and the selling price.

Q: Does the number of times a vehicle has been upped in the catalog to raise its position impact the selling price?

A: The number of times a vehicle has been upped has a negligible impact on the selling price.

Q: Relationship between Engine Type and Body Type? What is the impact of Engine Type and Body Type on the selling price?

A: Sedan and Gasoline is the most common followed by Gasoline and Hatchback. Engine Type and Body Type do have an impact on the selling price, but the extend of this impact will need to be investigated in our overall model.

Q: What is the most popular model and whether we can conclude that the popularity of a model has a direct impact on the price of a vehicle?

A: The most popular is the Passat. Furthermore, The popularity of a vehicle does seem to have an impact on the average_price of a vehicle. However, more attributes would be needed to predict price

Q: What is the average age of each vehicle manufacturer and whether the manufacturer changes how the production year impacts the selling price?

A:

#Group cars by manufacturer name
manufacturer_year <- group_by(cars_edited, manufacturer_name)

#Summarize the manufacturer years average
manufacturer_year_averages <- summarise(manufacturer_year, average = mean(year_produced, na.rm = TRUE))
# 1) Average age of each vehicle manufacturer
manufacturer_year_averages
## # A tibble: 55 x 2
##    manufacturer_name average
##    <chr>               <dbl>
##  1 Acura               2007.
##  2 Alfa Romeo          1999.
##  3 Audi                2000.
##  4 AvtoVAZ             1994.
##  5 BMW                 2003.
##  6 Buick               2014.
##  7 Cadillac            2006 
##  8 Chery               2011.
##  9 Chevrolet           2011.
## 10 Chrysler            2002.
## # … with 45 more rows

The manufacturer does change how the production year affects the selling price.

What are the justifications for your answers?

How the analysis of the graphs lead to the answer:

Q: What impact does a region have on price?

Justification: To answer this question we decided to use a pie chart so we can see the average prices per region in comparison with each other.

Looking at the graph it seems like only one region has an impact on the price of vehicles, that being the Minsk Region. However, we can’t conclude that with just this graph, we need to confirm this by using a one-way anova test.

group_by(regionPriceDF, regionPriceDF$location_region) %>%
  summarise(
    count = n(),
    mean = mean(price_usd, na.rm = TRUE),
    sd = sd(price_usd, na.rm = TRUE)
  )
## # A tibble: 6 x 4
##   `regionPriceDF$location_region` count  mean    sd
##   <chr>                           <int> <dbl> <dbl>
## 1 Brest Region                     2989 5091. 4652.
## 2 Gomel Region                     3140 5022. 4603.
## 3 Grodno Region                    2485 4745. 4223.
## 4 Minsk Region                    24193 7668. 7117.
## 5 Mogilev Region                   2678 4622. 4654.
## 6 Vitebsk Region                   3005 4870. 4450.

Looking at the results of the one-way anova we can confirm our assumed conclusion. The prices between vehicles isn’t significantly different, besides being in the Minsk Region. Thus, with this data set region doesn’t have an impact on vehicle price.

Q: What is the distribution of manufacturers and whether manufacturers have a significant impact on the asking price of a vehicle?

Justification: To Tackle the question to this problem we decided to use a bar graph, since we have categorical data and continuous data.

manuPriceDF <- group_by(cars_edited, manufacturer_name)
manuPriceDF_averages <- summarise(manuPriceDF, average_price_usd = mean(price_usd))
View(manuPriceDF_averages)
ggplot(manuPriceDF_averages, aes(x = average_price_usd, y = manufacturer_name)) + geom_bar(aes(fill = manufacturer_name),stat="identity") + geom_text(aes(label =  paste0("$",round(average_price_usd)), hjust = 1))

Looking at the bar graph we can make an assumption of the data, and that assumption seems is that the manufacturer of a vehicle has an impact on asking price. Again though we can’t assume it’s correct, we need to confirm this with a one-way anova test.

manuSumm <- group_by(manuPriceDF, manuPriceDF$manufacturer_name) %>%
  summarise(
    count = n(),
    mean = mean(price_usd, na.rm = TRUE),
    sd = sd(price_usd, na.rm = TRUE)
  )

# Compute the analysis of variance
res.aovTwo <- aov(manuPriceDF$price_usd ~ manuPriceDF$manufacturer_name,
                  data = manuPriceDF)

summary(res.aovTwo)
##                                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## manuPriceDF$manufacturer_name    54 2.917e+11 5.402e+09   160.1 <2e-16 ***
## Residuals                     38435 1.297e+12 3.374e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the results we can confirm that there is a relationship between the manufacturer and asking price.

Q: What is the relationship between odometer and price?

Justification: When we investigate this graph we initially notice that the line of best fit seems to be going down as the odometer value increases.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

However, this doesn’t justify that the higher the odometer the lower the price is for the vehicle. In order to show this we will use linear regression, and check the percentage of accuracy of that line. Using this we will come up with an answer to our question.

## `geom_smooth()` using formula 'y ~ x'

summary(odometer_on_price)
## 
## Call:
## lm(formula = price_usd ~ odometer_value, data = cars_edited)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11577  -3514  -1122   2064  40854 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.158e+04  6.203e+01  186.65   <2e-16 ***
## odometer_value -1.986e-02  2.186e-04  -90.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5830 on 38488 degrees of freedom
## Multiple R-squared:  0.1765, Adjusted R-squared:  0.1765 
## F-statistic:  8248 on 1 and 38488 DF,  p-value: < 2.2e-16
confint(odometer_on_price)
##                        2.5 %        97.5 %
## (Intercept)     1.145654e+04  1.169971e+04
## odometer_value -2.028451e-02 -1.942748e-02
sigma(odometer_on_price)*100/mean(cars_edited$price_usd)
## [1] 87.89188

Once checking the accuracy of the linear regression, we see that it is 87.89% accurate. Thus we can conclude that the higher the odometer the lower the price of the vehicle will be.

Q: Does the number of photos a vehicle has impact the selling price?

Justification: With our scatter plot graph we added a line of best fit for the data.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

From the graph it seems like the more photos of a vehicle there are the higher the price will be. However, we can’t make that assumption based on just this graph, we need to use linear regression to confirm this.

## `geom_smooth()` using formula 'y ~ x'

Now that we have the the linear regression line we need to make sure that the line fits the data with a high accuracy.

summary(number_of_photos_on_price)
## 
## Call:
## lm(formula = price_usd ~ number_of_photos, data = cars_edited)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24082  -3884  -1585   2249  44082 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3418.275     58.159   58.77   <2e-16 ***
## number_of_photos  333.297      5.098   65.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6095 on 38488 degrees of freedom
## Multiple R-squared:  0.09997,    Adjusted R-squared:  0.09994 
## F-statistic:  4275 on 1 and 38488 DF,  p-value: < 2.2e-16
confint(number_of_photos_on_price)
##                     2.5 %    97.5 %
## (Intercept)      3304.283 3532.2680
## number_of_photos  323.305  343.2882
sigma(number_of_photos_on_price)*100/mean(cars_edited$price_usd)
## [1] 91.88472

With the accuracy of the linear regression being 91.88% we can now conclude that the more photos a vehicle has, the higher the selling price.

Q: Does the number of times a vehicle has been upped in the catalog to raise its position impact the selling price?

Justification: With using a scatter plot and the use of a line of best fit it appears that the more a vehicle has been upped the higher the price will be.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Again however, we can’t assume this to be true without checking the linear regression and getting the accuracy of that line.

## `geom_smooth()` using formula 'y ~ x'

up_counter_on_price <- lm (price_usd ~ up_counter, data = cars_edited)
summary(up_counter_on_price)
## 
## Call:
## lm(formula = price_usd ~ up_counter, data = cars_edited)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14558  -4502  -1852   2305  43438 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6493.0457    34.9345  185.86   <2e-16 ***
## up_counter     8.5694     0.7548   11.35   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6413 on 38488 degrees of freedom
## Multiple R-squared:  0.003337,   Adjusted R-squared:  0.003311 
## F-statistic: 128.9 on 1 and 38488 DF,  p-value: < 2.2e-16
confint(up_counter_on_price)
##                   2.5 %     97.5 %
## (Intercept) 6424.573138 6561.51822
## up_counter     7.089888   10.04893
sigma(up_counter_on_price)*100/mean(cars_edited$price_usd)
## [1] 96.69137

Since the accuracy of the linear regression is 96.69% we can conclude that the number of times a vehicle has been upped raises the selling price

Q: Relationship between Engine Type and Body Type? What is the impact of Engine Type and Body Type on the selling price?

Justification: For this question we chose to use a mosaic plot because we are comparing more than 2 variables.

Based on this mosaic plot we can see that the body types Pickup, and Limousine are the only ones that have a significant impact on engine type. We can’t assume this however, so again let’s verify this with a few tests.

#Aov3
body_engine_type_on_price.aov3 <- aov(price_usd ~ engine_type * body_type, data = cars_edited)
summary(body_engine_type_on_price.aov3)
##                          Df    Sum Sq   Mean Sq F value Pr(>F)    
## engine_type               2 1.301e+10 6.503e+09  204.12 <2e-16 ***
## body_type                11 3.457e+11 3.142e+10  986.32 <2e-16 ***
## engine_type:body_type    11 4.280e+09 3.891e+08   12.21 <2e-16 ***
## Residuals             38465 1.225e+12 3.186e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.tables(body_engine_type_on_price.aov3, type="means", se = TRUE)
## Design is unbalanced - use se.contrast() for se's
## Tables of means
## Grand mean
##          
## 6632.904 
## 
##  engine_type 
##     diesel electric gasoline
##       7412    17465     6237
## rep  12869       10    25611
## 
##  body_type 
##     cabriolet coupe hatchback liftback limousine minibus minivan pickup sedan
##         11309  7798      4179     8023      8550    7780    5889  11470  5947
## rep        75   652      7644      551        12    1368    3608    129 12977
##       suv universal  van
##     13782      4786 6036
## rep  5162      5504  808
## 
##  engine_type:body_type 
##            body_type
## engine_type cabriolet coupe hatchback liftback limousine minibus minivan pickup
##    diesel    6350     10120  4120      8239               8638    6897   10704 
##    rep          4        31  1571        96        0      1260    1956      74 
##    electric                 15213     26475                                    
##    rep          0         0     8         2        0         0       0       0 
##    gasoline 11236      7325  4000      7713     8154      6462    5224   13153 
##    rep         71       621  6065       453       12       108    1652      55 
##            body_type
## engine_type sedan suv   universal van  
##    diesel    6114 14959  5784      6942
##    rep       2554  1678  2934       711
##    electric                            
##    rep          0     0     0         0
##    gasoline  5701 13194  4141      4712
##    rep      10423  3484  2570        97
#Tukey HSD
TukeyHSD(body_engine_type_on_price.aov3)

These tests confirm our original assumption. Thus, Body types Pickup, and Limousine have an impact on engine type.

Q: What is the most popular model and whether we can conclude that the popularity of a model has a direct impact on the price of a vehicle?

Justification: With the nature of this question we can figure out the answer without a graph. We just need to look at the views of the variables we create.

Because of this analysis we can conclude that the popularity of a vehicle does have an impact on the average_price of a vehicle.

Q: What is the most popular model and whether we can conclude that the popularity of a model has a direct impact on the price of a vehicle?

Justification: With the nature of this question we need to organize some data before we can graph it.

models_sorted <- group_by(cars_edited, model_name)
#Find popularity of vehicles
models_counted <- cars_edited %>% count(model_name) %>% arrange(desc(n))

#Find average price for each model
models_sorted_averages <- summarise(models_sorted, average_price_usd = mean(price_usd))

#Get the count of each model
models_sorted_avg_with_cnt <- models_sorted_averages %>% mutate(counts = count(cars_edited, model_name) %>% dplyr::select(2))
models_sorted_avg_with_cnt$counts <- as.numeric(unlist(models_sorted_avg_with_cnt$counts))

Now that we have the data in a way where the models of each of the cars has a count, with a price we can graph it.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

From this scatter plot that has a line of best fit, it is kind of hard to tell what the result will be. Now we can run linear regression on this to be sure of what relationship there is.

## `geom_smooth()` using formula 'y ~ x'

Next we need to be sure of the accuracy of the linear regression line.

modelPricePerCount <- lm (average_price_usd ~ counts, data = models_sorted_avg_with_cnt)
summary(modelPricePerCount)
## 
## Call:
## lm(formula = average_price_usd ~ counts, data = models_sorted_avg_with_cnt)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8887  -6096  -2217   2863  40964 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9045.783    271.631   33.30  < 2e-16 ***
## counts        -9.966      2.975   -3.35 0.000836 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8412 on 1116 degrees of freedom
## Multiple R-squared:  0.009954,   Adjusted R-squared:  0.009067 
## F-statistic: 11.22 on 1 and 1116 DF,  p-value: 0.0008361
confint(modelPricePerCount)
##                  2.5 %      97.5 %
## (Intercept) 8512.81827 9578.747230
## counts       -15.80376   -4.128411
sigma(modelPricePerCount)*100/mean(models_sorted_avg_with_cnt$average_price_usd)
## [1] 96.65859

Because of this analysis we can conclude that the popularity of a vehicle does have an impact on the average_price of a vehicle.

Q: What is the average age of each vehicle manufacturer and whether the manufacturer changes how the production year impacts the selling price?

Justification: With this question we decided to use a scatter plot and color all the points in the scatter plot with a color corresponding to a different manufacturer.

Again we can’t look at just the graph and call it a day, we need to confirm the results using a different method.

manufacturer_price <- aov(price_usd ~ manufacturer_name * year_produced, data = cars_edited)
summary(manufacturer_price)
##                                    Df    Sum Sq   Mean Sq F value Pr(>F)    
## manufacturer_name                  54 2.917e+11 5.402e+09   428.4 <2e-16 ***
## year_produced                       1 6.854e+11 6.854e+11 54355.5 <2e-16 ***
## manufacturer_name:year_produced    54 1.273e+11 2.357e+09   186.9 <2e-16 ***
## Residuals                       38380 4.840e+11 1.261e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With these results we can confirm our original assumption. The manufacturer does change how the production year affects the selling price.